Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for async f! and j! #229

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

charleskawczynski
Copy link
Member

@charleskawczynski charleskawczynski commented Oct 10, 2023

@tapios suggested adding this if it's not too difficult, and here it is:

This PR adds support for asynchronous f! and j! on both CPU and GPU.

We can try it out by passing the comms context, but we don't need to use it if we don't want.

@tapios
Copy link

tapios commented Oct 10, 2023

Nice! When you know how much it affects performance, please let us know.

@simonbyrne
Copy link
Member

@charleskawczynski
Copy link
Member Author

charleskawczynski commented Oct 23, 2023

@charleskawczynski
Copy link
Member Author

Interesting, we're hitting maxiter reached in saturation_adjustment:. Ah, right, it was reduced to 3. 🤷🏻

@charleskawczynski
Copy link
Member Author

I guess, maybe, we need a smaller timestep? (we could try increasing maxiter, but I assume it was reduced for good reason).

@charleskawczynski
Copy link
Member Author

charleskawczynski commented Feb 29, 2024

Update on this: the failures in ClimaAtmos were because the device was not synchronized, adding proper syncs did fix this.

I've updated the PR in a couple ways:

  • The multiple streams version is now in a CUDA extension, so that we don't need to directly depend on CUDA
  • I've kept both the multiple streams and multiple tasks for CUDADevices (we should compare them)
  • I've added a multithreaded version for the CPU

We should compare the streams vs tasks on the CUDA side. The multiple streams version with events do seem to be executing T_exp! and T_lim! asynchronously on separate streams, however, are not executing concurrently. The reason being that we're probably utilizing too many resources, and the scheduler is deciding to execute them serially.

One way we can try getting concurrent execution is by reducing the number of blocks in our kernel launches, but we might only want to (somehow) do that in T_exp! and T_lim!, rather than globally.

So, all in all, this isn't immediately offering a performance improvement on GPUs yet. The CPU multithreaded implementation might be an improvement, but it's not yet been tested/benchmarked. Also, there could be interplay with threads being used in the broadcast kernels (which, perhaps, we could/should just remove).

@charleskawczynski
Copy link
Member Author

charleskawczynski commented Mar 25, 2024

Further update: we will not add async support with T_exp! and T_lim!, as fusing these allowed us to eliminate dss calls. Instead, we'll add async support with f! and j!.

I've updated the PR to reflect this.

@charleskawczynski charleskawczynski changed the title Add support for async T_exp! and T_lim! Add support for async f! and j! Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants